BINARY MODELS FOR MARGINAL INDEPENDENCE 

MATHIAS DRTON AND THOMAS S. RICHARDSON 

Abstract. Log-linear models are a classical tool for the analysis of contingency ta- 
bles. In particular, the subclass of graphical log-linear models provides a general 
framework for modelling conditional independences. However, with the exception of 
special structures, marginal independence hypotheses cannot be accommodated by 
these traditional models. Focusing on binary variables, we present a model class that 
provides a framework for modelling marginal independences in contingency tables. 
The approach taken is graphical and draws on analogies to multivariate Gaussian 
models for marginal independence. For the graphical model representation we use 
bi-directed graphs, which are in the tradition of path diagrams. We show how the 
models can be parameterized in a simple fashion, and how maximum likelihood esti- 
mation can be performed using a version of the Iterated Conditional Fitting algorithm. 
Finally we consider combining these models with symmetry restrictions. 



1. Introduction 



In seminal work 



Anderson (1969 



1970 



19731 ) studied Gaussian models defined by hy- 



potheses that are linear in covariances. Such hypotheses include as a special case, zero 
restrictions on covariance matrices. These restrictions correspond to marginal indepen- 
dences, which may arise for example through confounding effects of unobserved variables 
(|Cox and Wermuthl . Il99a . [l99J; IPearl and Wermuthl . 1 1994 : lllichardson and SpirteslhoO^ 



For a graphi cal representation of zero restrictions on covariance matrices, 



Cox and Wermuth 



(jl993l . 1996 ) introduced covariance graphs: each variable is represented by a vertex; 
two vertices are linked by a dashed edge if the model does not set the corresponding 
covariance to zero. Dashed edges differentiate these graphs from undirected graph s, 
which represent zero hypotheses on the inverse covariance matrix (jLauritzenl . [1996). 
More recently, a number of authors have used b i-dire cted edges («-►) in place of dashed 
edges which is consistent with Sewall Wright's (|192ll ) path diagram notation; compare 
Figures [TJ [2] and U fa) below. Covariance graph models have app e ared in several differ 
ent contexts (e.g ., 



Butte et al 



2000 



Diaconis and Evansl . 



2002 



Grzebvk et all 12004 : 



Mao et all 12004 ) . Maximum likelihood (ML) estimation and likelihood ratio (LR) tests 
in these Gaussian models can be carried out using the Iterative Conditional Fitting 
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algorithm (IDrton and Richardson!. 120031: IChaudhuri et al.l . 120071 ) , which is implemented 
in the 'ggm' package in R (jMarchettil . I2006I ). 

There have been several efforts ai med a t developing binary models with analogous in- 
dependence stru cture. Kauermann (119971') uses the multivariate logistic (m-logit) trans- 
forma tion due to lMcCullaghl (|1989l ) ; McCullagh and Nelderl (| 19891 ) ; Idonek and McCullagh 
()1995l ) , which consists of selecting the highest order interaction term from every margin 
(see also iBergsma and Rudasl . l2002bl ) . Cox's (119931 ) assumes that the joint distribution 
is quadratic exponential, and then approximates marginal distributions via series expan- 
sions. An alternative approach is to use the nonparametric concept of independence in 
order to form models for categorical data that are analogous to Gaussian models. Many 
existing discrete models, such as the popular graphical log-linear models for modelling 
cond i tional independence i n contingency tables are often motivated this way (jWermuthl . 



1976; 



Darroch et al 



198(3 ). In this paper we take this route to developing a general 



framework for modelling marginal independence that is a natural counterpart to graph- 
ical log-linear models. 

For an example of a marginal independence pattern that cannot be represented using 
log-linear models but that our new models can accomodate very naturally suppose that 
we are investigating the relationship between alcohol dependence and depression. We 
have data from female mono-zygotic twins, indicating whether or not each twin is alcohol 
dependent (Ai) and whether or not they suffer from major depression (Di); see Table 
[H Consider the two graphs shown in Figure Q3 Both hypothesize that for each twin 
there are independent factors relating to individual experiences (Si) which influence 
both alcoholism and depression; however, graph (b) hypothesizes in addition that there 
is a single genetic factor which influences both traits, while graph (a) supposes that 
there is no such single factor, and that Ga, Gd, Si and S2 are mutually independent. 
Graph (b) does not imply any independence restrictions relating the observed variables, 
while graph (a) implies that 



(1) 



A 1 ALD 2 and A 2 iLDi; 



using the notation of iDawidl (|1979l ). Under (a) one twin's alcohol dependence status is 
independent of the other twin's depression status. Note that we do not make any as- 
sumption concerning the marginal distributions of the unobserved variables. In particu- 
lar, testing the hypothesis ([I]) provides a way of testing the scientific hypothesis leading 
to graph (a) without having to specify the number of levels of the possibly complex 
genetic factors . This focus on implied i ndepe ndences is in the spiri t of the work on an 



19961 ) . We remark that 



Ekholm et al 



cestral graphs ({Richardson and Spirtes, 2002) and summary graphs (|Cox and Wermuthl . 



2n06al lbh recently fit latent class models to twin 



data including those in Table [TJ The precise relationship between latent class models 
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Tabl e 1. Data on n = 597 pairs of twins; adapted from iKendler et al.1 
(|l992h . 
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Figure 1. Possible generating models. Observed variables are shaded, 
(a) Separate genes relating to Alcohol (Ga) and Depression (Gd); (b) 
a common gene (G). represents the personal experiences of twin j. 
Unobserved variables are hypothesized to be independent. 



and the marginal independence models we discuss in the sequel is an open problem, 
but if two such models can be shown to coincide then the EM algorithm provides an 
alternative method for model fitting. However, in this context it should be noted that 
there exist Gaus sian covariance graph models that cannot be parameterized by latent 



variable models ([Richardson and Spirtesl . 12002 . §8.6). 



If the variables were jointly Gaussian, then hypothesis (pQ) would restrict the ap- 
propriate two entries in the covariance matrix to zero. Hence, a likelihood ratio test 
of (pQ) could be performed by fitting the covariance matrix subject to this restriction. 
However, when the variables are binary, performing such a test is not at all straightfor- 
ward. In particular, there does not exist a log-linear model that is equal to the family 
of binary distributions obeying ([1]). In fact, the marginal independence restrictions 
correspond to complicated non-linear restrictions on the parameters of the log-linear 
expansion of the joint density of (Ai, A 2 , D\, D 2 ). The difficulty encountered here is an 
instan ce of the problem of lack of c ompatibility of margins in log-line ar parametriza- 
tions (IGlonek and McCullaghl . Il995l . p. 534); see also iMcCullaghl (jl989l ). In this simple 
example, a practical solution might be to combine separate marginal tests, but there 
would be an obvious loss of efficiency in so doing. The methods developed in this paper 
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Figure 2. (a) a bi-directed four cycle; (b) a bi-directed four chain; (c) 
graph with two disconnected components. 

allow the loss of efficiency to be avoided by providing models that capture precisely 
hypotheses like ([I]). The fitting algorithm we present allows tests that make use of all 
data available, such as LR- and x 2 -tests, to be performed. 

The remainder of the paper is organized as follows. In £}2]we describe the graphical 
representation of marginal independence patterns. This representation facilitates the 
understanding of marginal independence structures in multivariate normal distributions 
and provides the basis for our transfer of model structure to the binary case. This trans- 
fer yields models that are defined implicitly in terms of independence constraints. In 
we show that a linear change of coordinates leads to a surprisingly simple characteriza- 
tion of marginal independence. This characterization immediately yields a multilinear 
model parameterization. ML estimation in the proposed models is discussed in $4] and 
the Iterative Conditional Fitting algorithm for computing ML estimates is developed in 
$51 In the methodology is illustrated in an application to survey data. In the twin 
data example mentioned above, symmetry under permuting the labels 1 and 2 given 
to the twins is an interesting hypothesis. Combining such symmetry constraints with 
marginal independence is the topic of section £J7J We conclude in $BJ where connections 
to other work are discussed. 

2. Bi-directed graphs and marginal independence 

A bi-directed graph G = (V, E) is a graph whose edges satisfy (v, w) G E if and 
only if (w, v) G E. The edges are drawn bi-directed as v *-* w if {v , w) G E, see 
Figure [H Bi-directed graphs are special cases of the ancestral graphs cons idered in 
Rich ardson and Sp irtes and the acyclic directed mixed graphs studied in lRichardsonl 

see also lpearll (|200ol . p. 146). If a vertex w is equal or adjacent to another vertex 



v in a bi-directed graph, then w is said to be a spouse of v, and we write w G Sp(t>). 
For a set A C V, we define Sp(A) = U(Sp(u) | v G A). Note that A C Sp(^) under this 
convention. 

In graphical modelling, the Markov properties of a graph, i.e., independence state- 
ments associated with the graph, are used to define independence models for a random 
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vector X = (X v \ v € V) whose index set is identified with the vertex set V of the graph. 
The independence models associated with bi-directed graphs are based on marginal in- 
depen dence, which is manifested in the connected set Markov property of [Richardson 
J2OO3I . §4). A vertex set C C V is connected if every pair of vertices v,w € C are 
joined by a path on which every vertex is in C. The distribution of a random vector 
X = (X v I v S V) is said to satisfy the connected set Markov property if 



(2) 



X c ALX v \ Sp ( C ), 



whenever ^ C C V is a connected set. Algorithm E in lKnuthl (|1968l . p. 354) computes 
equivalence classes from a list of known equivalent pairs. This can be used to find the 
inclusion maximal connected sets in a given graph by letting the edges in the graph 
define the equivalent pairs. 

A more exhaustive Markov property is the global Markov property, which requires 
all the marginal independences in ([2]), but also additional conditional independences. 
More precisely, the distribution of X satisfies the global Markov property of G if 

(3) A is separated from B by V \ (A U B U C) in G implies X A ALX B \ X c . 

Here, A, B and C are disjoint subsets of V, and C may be empty. The separation in 
© is the usual graph-theoretic separation in which two sets A, B C V are separated 
by a third set D C V if any path from a vertex in A to a vertex in B contains a 
vertex in D. Despite the global Markov property being more exhaustive, a distribution 
satisfies the global Markov property if and only if it satisfies the connected set Markov 
property. Completeness of the global Markov pr operty for bi-directed graphs follows 
from the completeness results for ancestral graphs (IRichardson and Spirted . 120021 . Thm. 
7.6). N ote also the dual ity between ([3]) and the global Markov property for undirected 
graphs (jLauritzenl . Il996l . p. 32). 



Example 1. (Four-cycle). The bi-directed graph depicted in Figure [2|a) represents the 
two pairwise independence relations: 

X- L ALX 4 and X 2 ALX 3 

under both the connected set, and global Markov properties. This graph represents the 
independence hypothesis considered in the introductory example; compare ([T]). 

Example 2. (Bi-directed four-chain) . Consider the bi-directed graph depicted in Figure 
[2](b) . The connected set Markov property states 

X 1 1L{X 2 ,X 4 ), X 2 lL(Xi,X 3 ), X 3 ALX 2 , X4ILX1. 

The global Markov also states, for example, X±ALX 2 , X\ALX 2 \ X4, and X 2 ALX 3 \ X\. 



6 



MATHIAS DRTON AND THOMAS S. RICHARDSON 



Clearly, every singleton {v} is a connected set and thus requires that 

(4) X v ALX v \ Sp ( v y 

It follows that if the distribution of X satisfies the connected set Markov property, 
then it satisfies the pairwise Markov property which requires that X V ALX,„ whenever 



v <A w. The converse is true for multivariate normal distributions (jKauermannl . Il996l . 



Prop. 2.2) but false in general. We note that the Markov property in (|4l) oc c urs in the 



combinatorial result known as the Lovasz Local Lemma (jErdos and Lovasa . 1 19751 ) . In 
the next section we define models using the connected set (or equivalently the global) 
Markqy prop erty, and not the much less restrictive pairwise Markov property (see also 



Haber 



198 



Example 3. (Graph with two disconnected components). The pairwise Markov property 
for the graph in Figure Etc) requires X\ALX 2 and X\ALX 3 , whereas the global and 
connected set Markov property also require the stronger condition that XiAL(X 2 , X3). 
For example, consider the distribution of (X\, X 2 , X3) given by 

Pooo = 0.02, p io = 0.03, pioo = 0.05, pn = 0.10, 
poor = 0.08, poii = 0.12, p 10 i = 0.25, p m = 0.35, 

where p hi2h = P{X 1 = i u X 2 = i 2 ,X 3 = i 3 ). Then X x il_X 2 , X X ALX 3 but X 1 ^L(X 2 , X 3 ). 

We conclude this discussion of Markov properties with a lemma that provides a 
useful characterization of joint distributions of discrete random vectors that obey the 
connected set Markov property. The lemma is based on the fact that every set D C V 
that is not connected in G can be partitioned uniquely into inclusion-maximal connected 
sets C\ , . . . , Cy> , 

(5) D = C 1 UC 2 U---UC r . 
Here, the symbol U denotes a union of disjoint sets. 

Lemma 4. Let X = (X v \ v E V) be a discrete random vector X = (X v | v G V) taking 
values in the set 2. The joint distribution of X satisfies the connected set Markov 
property for a bi-directed graph G = (V, E) if and only if for every disconnected set 
D (IV it holds that 

(6) P(X D =i D ) = P(X Cl =i Cl )P(X C2 =ic 2 )---P(X Cr =ic r ), Vi€X, 
where Ci, . . . , C r are the inclusion-maximal connected sets satisfying |5]). 



Proof. If P satisfies the connected set Markov property, then it also satisfies the global 
Markov property, from which we can deduce complete independence of the subvectors 
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associated with the r connected sets in ([5]), 

(7) X Cl ALX C2 AL...ALX Cr . 

This complete independence clearly implies ([6]). 

Conversely, let C be a connected set. Then D = C(j(V \ Sp(C)) is a disconnected 
set, and © implies in particular Xc-W-Xy\ s p (c) : which is (J2j> . □ 

3. Binary marginal independence models 

Let X = (X v | v G V) be a random vector with binary components, i.e., X takes on 
values in the set 2 = {0, 1} V , and let P be the joint distribution of X. (Note that to 
keep notation simple, we will often use the same letter to indicate both a set and its 
cardinality.) For i = (i v \ v G V) G I, let 

(8) pi = P{X V = i v for all v G V) 

be the joint cell probability of i. The multivariate Bernoulli distribution of X is deter- 
mined by the vector 

(9) p = | i G 1) 

in the 2^ — 1 dimensional probability simplex A. 

Using the Markov properties discussed in the previous section we can associate an 
independence model with a bi-directed graph G = (V,E). 

Definition 5. The binary bi-directed graph model associated with G is defined as the 
family B(G) of probability distributions for a binary random vector X = {X v j v G V) 
that obey the connected set Markov property (0|) for G. 

We begin our study of the implicitly defined model B(G) by making a change of 
coordinates in the probability simplex. For ^ A C V, we call 

(10) q A = P(X A = 0) = P(X V = for all v G A) 

the Mobius parameter associated with A. If desired, q A can be viewed as a moment 
for indicator variables associated with the designated levels of the considered binary 
variables, namely, 

The 2 V — 1 Mobius parameters can be computed from the joint cell probabilities p by 
the obvious summations 



(11) 



QA= ^2 Pi, 
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where %a = (iv \ v €. A). The summations (jlip define a map fi : A -> R 2 _1 taking the 
vector of joint cell probabilities p G A to the vector of Mobius parameters 



(12) 



(QA 



^ACV). 

We call the image Q = //(A) the Mobius simplex. This simplex has the 2 V vertices t^ A \ 
A QV, where for ^ B C V the i?-th component of is equal to 



B CA, 
B<£A. 



Clearly, v- A ' is the image under \x of the distribution placing point mass on the cell 
(Oa, 1v\a)- 



Proposition 6. The linear map 

fi : A 

P 



(13) 



- Q 

» q=(qA\®^ACV) 



is bijective. Its inverse v = \x 1 : Q — > A recovers the joint cell probabilities as alter- 



nating sums of Mobius parameters. Setting 
P0 A 1v\a = P ( X A 



0) X V\A 



= 1 we have 

i) = E (-i) |JMI 



B-.ACB 



Proof. By definition of Q, the map [i is surjective. In order to verify injectivity and 
the claimed form of the inverse, define two functions and on the power set of V. 
Let = q v \ A for A C V and $(V) = 1, and = P(Xy\A = 0,Xa = 1). 

Then $(^4) = E g.gcA^(^) an d the claim follows from the Mobius Inversion Lemma 
([Lauritzenl . 0996 . p.239). □ 

The ma ps v and u may be compu ted in Od V^^I" 1 ) additions via the Fast Mobius 



Transform (Kennes and Smets, 



1991 



). ADtrees (jMoore and Led . ll998l ) provide a memory- 
efficient data-structure for storing Mobius parameters. Moreover, t he matrix for t he 
map v can be shown to have a Kronecker product structure; compare I Jokinenl ([2006). 

Example 7. (Two binary random variables). Consider two binary random variables, 
i.e., V = {1, 2}. Then the Mobius parameters are 

Qi = Poo + Pol, <?2 = Poo + Pio, q 12 = poo- 
The joint cell probabilities can be recovered as 



Poo = 912, 

PlO = Q2 - <?12, 



poi = qi- <?i2, 

Pll = 1 - 9l - 12 + <?12- 
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The Mobius simplex is defined by the linear equalities expressing that pi, written in 
terms of q, is in the unit interval [0, 1] for all i Gl. In this example 

Q = {q = (<7i,52, Qn) e [0, l] 3 : gi + <?2 - 1 < qi2 < min{gi, g 2 }} 
is a 3-dimensional simplex with vertices (0,0,0)*, (0,1,0)', (1,0,0)', (1,1,1)'. 

As we show next, the constraints defining the independence model B(G) take on a 
simple form when expressed in terms of the Mobius parameter coordinates. 

Theorem 8. A probability vector p G A belongs to the binary bi-directed graph model 
B(Gr) if and only if its Mobius parameters q = fj,(p) satisfy that for every disconnected 
set D QV, 

(14) qD = qc 1 qc 2 ---qc r , 

where C\, . . . , C r are the inclusion-maximal connected sets forming the partition |5]). 

Proof. By Lemma [H p G B(G) implies (fl4j) . Conversely, consider a vector q G Q 
satisfying (fI3J) . and let p = u(q) be the associated probability vector. We show that 
p G B(Gr) by verifying condition ([6]) in Lemma 01 We proceed by induction on the 
number of ones in the vector if) G {0,1}^ appearing in ([6]), for some D, which we 
denote by k G {0, 1, . . . , V}. 

By (|14p . the claim (|14p holds for = 0. Suppose that the claim holds for all j < k. Let 
v be such that i v = 1 in ip. Let C±, . . . C r be the partition of D into inclusion-maximal 
connected components, and suppose that v G Ce- Then 

P(X D = i D ) = P(X D \{ v y = lD\{v}) ~ P( X D\{v} = i-D\{v}iX v = 0) 

= [ p ( x c e \{v} = *cam) - p ( x c e \{v} = ic t \{v}, x v = 0)] 

r 

= Hnxc^ic,). 

The second equality follows from the induction hypothesis applied to ii)\i v \, and to 
lo = {io\{v} > 0) since both vectors contain less than k ones. Hence, we have shown that 
© holds true for all disconnected sets D QV. □ 

Example 9. {Four cycle). For the bi-directed graph in Figure 0a ) we have 13 Mobius 
parameters associated with connected sets 

qi, q2, <?4, Ql2, Ql3, 524,534, <7l23, 5124, <Zl34, ?234, 51234- 

In order to define a distribution obeying Xi_LLX4 and X2-LLX3, the Mobius parameters 
of the two disconnected sets must satisfy 914 = qiq^ and (723 = 5253- 
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Theorem [8] can be read as providing a model parametrization. Let Qc = yu(B(G)) be 
the Mobius parameter vectors defining a distribution in B(G). Let C{G) be the family 
of non-empty connected sets of G. Define Tc to be the set of vectors (qc \ C G C[G)) G 
]^C(G) Q £ Mobius parameters of connected sets for which there exists a vector q G Qc 
with q c = qc for all C G C[G). 

Corollary 10. Let vc ■ Tc — > B(G) 6e i/ie multilinear map defined by setting Mobius 
parameters of disconnected sets equal to the expression in (Ijty, obtaining a vector q G 
R ~ , and setting p = u[q) G B(G). Then vc is a bisection, and we call it the Mobius 
parametrization of the model B(G). 

Since the Mobius parameters are related via inequalities (compare Example [7|) , this 
parametrization is not variation independent, but nevertheless is useful. The definition 
of the Mobius parameters is clearly not symmetric under re-labelling of the two states 
taken by the random variables. However, such re-labelling does not change the model 
B(Gr) because it is defined purely in terms of independence relations. 

Corollary 11. The dimension of the model B(G) equals dim(B(G)) = |C(G)|, the 
number of non-empty connected sets in G. 

In contrast, the dimension of the (binary) graphical log-linear model based on the 
undirected graph with the same edges as G would be equal to the number of non-empty 
complete sets in G. Here a set A C V is complete if any two vertices in A are adjacent. 
Since every complete set is connected, the dimension of the model B(G) is always larger 
than or equal to the dimension of the corresponding graphical log-linear model; compare 
Figure El 

Corollary 12. The family 

B + (G) = {p G B(G) : Pi > for all i G 1} 

of distributions with positive joint cell probabilities in the binary bi-directed graph model 
B(Gr) forms a \C{G)\- dimensional curved exponential family. 

Proof. More precisely stated, we claim that B_|_(G) is a |C(G)|-dimensional smooth 
manifold in the natural parameter space of the exponential family formed by the interior 
of the probability simplex A°. Let Q° be the interior of the Mobius simplex Q, and 
Q G the set of vectors in Q° that satisfy the constraints (|14p in Theorem [8j Let d = 
2 V — \C(G)\ — 1 be the number of non-empty disconnected sets of G. Define the map 
h : Q° — > U. d with coordinate functions ho(q) = Qd—QCiQc 2 ■ ■ ■ Qc r : where C\, C%, . . . , C r 
form the inclusion-maximal connected set partition of the non-empty disconnected set 
D; compare ([5]). Since h is C°°, it is clear that Q° c = h' 1 ^) is aJCfG) | -dimensional 



smooth manifold in M ; compare e.g. Thm. 1 in lGeiger et al.l (|200ll ). Our claim is 
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Figure 3. Cumulative number of models per dimension for |V| = 6 
binary variables. 



now established because the diffeomorphism v maps Q° to A°, and it is well-known that 
there is a diffeomorphism between A° (mean parameters) and the log-linear parameters 
(natural parameters of the exponential family). □ 

Remark 13. Instead of using the Mobius parameters in Theorem [8j we could have 
employed the dependence ratios 

QA 



ta 



i ntrod uced bv lEkholm et al.l (|1995l ); see also lEkholm et al 



(2000 



20031 ) . and lDarroch and Speed 



(jl983l ) where such ratios occur in specifying models termed Lancaster additive. The 
ratio ta compares the probability qA computed from the joint distribution p to the 
corresponding probability under the complete independence distribution that has the 
same univariate marginals as p. Clearly, Theorem [8] also holds if we replace each Mobius 
parameter by the corresponding dependence ratio. 



4. Maximum likelihood estimation 



Assume we observe a sample of size n drawn from a distribution p in the binary 
bi-directed graph model B(G), giving rise to multinqmially distributed counts N{i), 



t£l. (For the link to Poisson sampling see lLauritzenl . Il996l . §4.2.1.) The probability 
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of observing the particular counts n(i) G No, i G I, is equal to 



(15) 



P(N(i) = n(i), i G 2) 



where we set 0° := 1. Hence, the likelihood function for the model B(G) is the map 

L : B(G) -> M, 

(16) v i i n! TT P " (t) 

Proposition 14. An MLE of p G B(G) always exists. 

Proof. As a subset of the probability simplex A, the model B(G) is bounded. It is 
also closed, hence compact, which in conjunction with the continuity of the likelihood 
function implies the claim. Closedness follows from the fact that if for two sets A,B Q 
V, Xa^-Xb under a sequence of probability distribution P n with vector of joint cell 
probabilities p n G A, then under a probability distribution P corresp onding to a limit 
point p G A of the sequence (p n ) it is also true that X A -U-Xb; compare iLauritzenl (|1996l . 
Prop. 3.12). □ 

If all counts n(i), i 6l, are positive, then an MLE of p G B(G) will actually have 
positive joint cell probabilities, i.e., lie in B+(G). An open question is when an MLE 
exists in B + (Gr) if some of the counts n(i) are zero. F or recent work on the analogous 



question in the case of hierarchical log-linear models see lEriksson et al.l (|2006l ). Another 
open problem concerns uniqueness of the MLE, i.e., can one find a graph G and (non- 
degenerate) counts n(i) such that the likelihood function of B(G) has more than one 
local maximum? 

Ignoring an additive constant the log-likelihood function for the model B(G) is of 
the form £(p) = ^iex n (0 ^°SPi- Using Proposition [UJ we can express the log-likelihood 
function also in terms of Mobius parameters as 



G 



q h+ n(0 A , 1 V \ A ) log 



ACV 



\\B\A\ 



B-.ACB 



where q$ = 1. Further, t[q) can be written in terms of the connected set Mobius 
parameters (qc \ C G C(G)) by replacing qs for a disconnected set B by the appropriate 
product of connected set Mobius parameters; see (fl4"|) . 
For two subsets A, W C V, nested as A C W, define 



w 
Pa 



P(X A = 0,X 



W\A 



I) 
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In particular, if W = V, then p\ = P(X A = 0, X v \a = 1) is a joint cell probability. 
Similarly define n\ to be the frequency of observations in which X A ■ 
Then the likelihood equations associated with the model B(G) are 



0, and X 



V\A 



1. 



d£ 
dqc 



A:Sp(C)n(A\C*)=0 



U \C\A\ n A V\Sp(C) 
~ L ) n vPA\C 

y a 



o 



for every (non-empty) connected set C in G. We prove this in the Appendix (Corollary 
[T2l) . where we also compute the second derivative of £(q), which yields the Fisher- 
information for B(G). 

Having written the log-likelihood function as a function of the para meters (qc I C £ 
C(G)\ it can be maximized us i ng gr adient-based ascent methods (see also lLang and Agrestil . 



19941 ; iBergsma and Rapcsakl . 12005). We implemented such a meth od in the statisti- 



cal programming environment R (|R Development Core Teaml . 12004 ) using the routine 
'nlm'. In doing this we found it beneficial to work with the logarithms of the parame- 
ters qc because this linearizes (HH); the examples we considered involve positive counts 
such that we may assume that qc is positive and log qc well-defined. In our experience, 
this approach works well for smaller and sparser graphs that induce a lower-dimensional 
model. However, for larger and denser graphs, such as in FigureH^a), we found an alter- 
native approach that focuses on the model-defining constraints to perform better. This 
alternative method, described in the next section, is the binary analogue to the Itera- 
tive Conditional Fitting (ICF) a lgorithm that was developed for ML fitting of Gaussia n 
marginal independence models (jDrton and Richardson! . 120031 ; IChaudhuri et all 120071 ) . 
Binary ICF plays a role dual to the Iterative Proportional Fitting (IPF) algorithm used 
to fit hierarchical log-linear models. 



5. Iterative conditional fitting 

Starting from some feasible estimate in B(G), such as the uniform distribution, the 
ICF algorithm improves a current feasible estimate by cycling through the vertex set 
V and performing an update step for each one of the vertices. At the update step for 
variable v £ V the marginal distribution P x - v of the variables — v = V \ {v} is fixed, 
and the conditional distribution p x A x -v required to determine the joint distribution 
of (X v | v G V) is estimated. This estimation is done subject to constraints that 
ensure that the newly determined joint distribution remains in the model B(G). In this 
presentation of ICF we assume that all observed counts n(i), are positive, which 

in particular entails that they were drawn from a distribution p S B + (G). Moreover, 
maximizing the likelihood function over B(G) is equivalent to maximizing it over the 
submodel B+(G), and we can assume that all joint distributions P considered in the 
sequel have positive joint cell probabilities pi > 0. In the case of zero counts, which will 
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be considered in future work, the conditional likelihood function considered in Algorithm 
5.1 is still concave but need no longer be strictly concave. Hence, the possibility of 
optima on the boundary has to be taken into account. 

For fixed marginal probability P(X- V = i- v ), the joint cell probability P(X V = 
i v ,X- v = i— v ) = 0, i 6 X, is determined by the conditional parameter 

9 v {i- v ) = 9 V (X„ V = i_ v ) := P(X V =0\X^ V = »_„). 

Let l- v = {0, Then there are |X_„| = 2 V 1 many parameters 9 v (i— v ). Notice 
that if v € D then 

P(X D = 0)= P(X B = 0,X V \ B = 1) 

B-.DCB 

(17) = Y, e v {i-v)P{X- v = i- v ). 

i-v=(0 B \{v} Av\b)£1-v ■ DCB 

In general, the binary bi-directed graph model B(G) imposes constraints on the 
conditional distribution p x A x -v _ j n order to specify the constraints in a non-redundant 
way, we focus on constraints of the form (I14p . rather than the equivalent conditional 
independence restrictions. Specifically, suppose that D is a disconnected set and that 
C is the inclusion-maximal connected subset of D containing v. By equation (|14p we 
require 

(18) P(X D = 0) = P(X C = 0)P(X D \ C = 0). 

Note that D\C may not be connected, so the model may require further factorization 
of P(Xd\c = 0). However, this only imposes a constraint on the fixed P(X- V ) margin, 
and so does not concern us here. We now express the constraint (|18p as 

P(X V = | x DX{v} = 0)P(X D \ {V} = 0) 

(19) = P(X V = | X C \ {V} = 0)P(X C \ {V} = 0)P(X D \ C = 0). 

(It is implicit here that if C \ {v} = then the second term on the right hand side is 
omitted.) Observe that only the first terms on each side depend on 6 V {-). Using (fT7|) . 
the first term on the left hand side of (|19|) may be expressed as 

P(X V = | X D \ {v} = 0) 

P(X v = 0, Xy\ D = j I X D \ {V} = 0) . 

(20) = ]T e v (x V \ D = j,x D \ {v} = o)P(x V \ D =j\x D \ {v} = o). 
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Similarly, the first term on the right hand side of (119p may be expressed as 

P(X V = | X C \ {V} = 0) 

(21) £ 0v{Xy\ C = j, X C \ {v} = 0)P(X V \ C = j I Xc\ {v} = 0). 
je{o,i}v\c 

Now, if the set D \ {v} was connected, then C and D\C would also be connected, 
contrary to the assumption. Since in the ICF algorithm we assume that all constraints 
on the marginal distribution of Xy^ v y hold, it follows that 

P(X D \ {V} = 0) = P(X D \ C = 0)P(X C \ {V} = 0). 

(Again, both C \ {v} and D\C may not be connected, so these terms may factorize 
further.) Since these terms are non-zero, they cancel from both sides of (|19p . leaving 
the constraint 

(22) ]T e v (j, o D \{v})p(x V \d = j I x DX{v} = o) = 

je{o,i} v \ D 

Yl 0vV> C \{v})P(Xy\ C = j I X C \ {V} = 0). 

je{o,i} v \ c 

It is important to note that for fixed margin P x ~ v the constraints (|22p are linear in 
the conditional parameters 6 V . The full set of constraints on the 9 V parameters may 
be obtained by considering every disconnected set D containing v and identifying the 
inclusion-maximal connected set C C D containing v. 
Let 

D v = {D : D C V, v G D, D is disconnected} . 

For each set D € D v , we define C V (D) to be the inclusion- maximal connected subset 
of D containing v. The disconnected sets D v and the connected sets C V (D) can be 
computed in preprocessing. Then the ICF update for vertex v can be implemented as 
follows. 

Algorithm 15. Update step in Iterative Conditional Fitting. 

Input: A probability vector p £ B(G) and vertex v. 

Output: A probability vector p G B(G) such that L(p) > L(p). 

Step 1. Construct the D v x constraint matrix A = (a rs ), where for each pair 
(D r ,j s ) € D v x we set 



a rs = P(X v \ Dr = {j s )v\D r I X Dr \ {v} = 0)l{(js)D r \{v} = 0} 

-P(Xy\C v (D r ) = Us)v\C v (D r ) I ^CV,(D r )\M = Q)^{(js)c v (D r )\{v} = °}- 



16 



MATHIAS DRTON AND THOMAS S. RICHARDSON 



Here all probabilities are computed under the distribution induced by the probability 
vector p, and I(-) is the indicator function. 

Step 2. Maximize the strictly concave conditional log-likelihood function 

n(i- v , 0) log 0(i- v ) + n(i- v , 1) log{l - 9(i- v )} 

subject to the linear constraints AO = 0, where 9 = (9(i- v ) \ i_„ G I- v ) is the vector 
of all conditional parameters. (If all counts are positive, the inequality constraints 
9 € [0, l] x - v need not be considered explicitly.) 

Step 3. Use the solution 9 V from step 2 to compute the new probability vector p € B(G) 
via 



Pi v i- V = P{X V = i v ,X. 



V 



&v(iv) P{X- V = i v ) if i v = 0, 

[l-9 v (i v )]P{X- v = i- v ) iii v = l. 



The optimization problem in step 2 of the ICF update algorithm has a unique local 
maximum and is n ot difficult to so lve. For example, one can employ the gradient 



projection method (jBertsekasl . 11999, §2.3), which performs a line search along the 
direction of the gradient projected on the kernel of A. A line search based on the 
Armijo-rule ensures convergence of the gradient projection method. The computation 
of the 2 V ~ 1 x 2 v ~ l projection matrix / — A \AA')~ l A requires the inversion of the 
D v x D v matrix AA' which is of full rank. However, the projection matrix has to be 
computed only once in order to solve the optimization problem in step (4) of Algorithm 
PT5l Since the Hessian of the conditional log-likelihood function maximized in step 2 of 
Algorithm [15] is diagonal it is also feasible to employ second derivative information in 
a projected Newton method, in which 9 is scaled by the matrix with diagonal elements 
equal to one over the square root of the diagonal elements of the Hessian. Since the 
Hessian depends on 9, the projection matrix in a projected Newton method has to 
be recomputed every time 9 is updated. However, based on our experience with our 
implementation of ICF in R, employing the Hessian information is beneficial. 

Having tackled the individual ICF updates we can run ICF from a feasible starting 
value. The algorithm then produces a sequence of feasible estimates whose accumulation 
points are solutions to the likelihood equations. In fact, the sequence is guaranteed to 
converge if there exist only finitely many solutions to the likelihood equations. These 
convergence guarantees follow f rom g eneral results about iterative partial maximization 



algorithms (jDrton and Eichlerl . 120061 . Appendix), of which ICF is an incarnation. Here, 
'partial maximization' refers to the fact that the update step for vertex v in Algorithm 
n~5l maximizes the log-likelihood function i(q), q = (qc \ C € C{G)) partially, namely 
when varying only components qc for which C is a connected set containing vertex 
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v. Components qc for connected sets not containing v remain fixed at the current 
estimates. 

In the above we proceeded vertex-by-vertex and est imated the univar i ate co nditional 



distribution of X v given X- v . In the Gaussian case, IChaudhuri et al.1 (j2007l ) describe 
how to run the ICF algorithm with multivariate updates. In this variant, one chooses 
complete vertex sets C QV and estimates, for fixed margin of X_c = -^V\c> the (mul- 
tivariate) conditional distribution of Xc given X_c under the marginal independence 
constraints. Such multivariate updates are also possible in the binary case discussed 
here. Let 

6 c {i-c) = 0c(X-c = i-c) := P(X C = | X^c = i-c). 
Let 1-c = {0, l} y -\ c \. Then CED becomes 

(23) P(X D = 0)= Yl 9c{i-c)P{X-c = i-c). 

i-C=(0 B \ {o} ,l v \ B )£l- C ■ DCB 

Since the set C is complete there are no equality constraints among the Mobius param- 
eters qA, ^ A C C, and one can proceed similarly as in the discussion leading up to 
([22]) to devise an analog to Algorithm [T5l with multivariate updates over complete sets. 

6. Example: Social survey data 
Sociologists and political theorists have long been interested in the relationship be 



tween trust in social institut ions and trust in other members of society (jPutnaml . 120021 : 



Sztompkal . l2000l : iLevil . Il998l ). Here, as an illustration of an exploratory analysis using 
binary independence models we examine seven questions relating to trust that are taken 
from the U.S. General Social Survey during the years 1975-94: 

• Trust 

Generally speaking, would you say that most people can be trusted or that you 
can't be too careful in life. (Can Trust; Cannot Trust; Depends) 

• Helpful 

Would you say that most of the time people try to be helpful, or that they are 
mostly just looking out for themselves? (Helpful; Lookout for Self; Depends) 

• Confidence in institutions 

/ am going to name some institutions in this country. As far as the people 
running these institutions are concerned, would you say you have a great deal of 
confidence, only some confidence, or hardly any confidence at all in them? (A 
Great Deal; Only Some; Hardly Any) 

ConClerg: Organized religion 

ConLegis: Congress 

ConBus: Major Companies 
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/Mem A / 


ConA 


( ConA 


(b) 


/Mem.Y 


UJnion/ \ 


v Bus./ 


Vlegis./ 


VUnion/ 




(a) 

Figure 4. Analysis of Trust data, (a) Marginal independence model; 
dashed edges correspond to pairwise odds ratios less than one. (b) Clas- 
sical graphical log-linear model. 

• Membership of organizations 

Here is a list of various organizations. Could you tell me whether or not you 
are a member of each type? (Yes; No) 

MemUnion: Labour unions MemChurch: Churches 

There were 13, 486 individuals who gave valid responses to all of these questions. For 
the purposes of illustration, for the questions relating to confidence in institutions we 
combine 'Some' and 'Hardly Any' to form a 'No' response; similarly for Trust we 
combine 'Cannot trust' with 'Depends' to form a 'No' group, and for Helpful we 
combine 'Take advantage' with 'Depends' to form a 'No' group. The counts are displayed 
in Tabled 

Using ICF in a backward stepwise selection we found the graph shown in Figure 
E|a). Assuming that the data in Table [2] arose in multinomial sampling, we obtain a 
deviance of 32.67 over 26 degrees of freedom, when compared to the saturated model of 
no independence. Using an asymptotic ^-approximation a p- value of 0.172 is obtained 
and the model is found not to be contradicted by the data. Since some expected cell 
counts are small, the asymptotic approximation should be treated with some caution. 
In the selected model all variables are marginally associated with confidence in business, 
but it is interesting that confidence in congress is marginally associated only to the two 
other confidence variables. Similarly, union membership is not marginally associated 
with additional variables other than church membership and confidence in business; the 
graph implies 

ConLegis _LL Helpful, Trust, MemUnion, MemChurch, and 
MemUnion _LL Helpful, Trust, ConLegis, ConClerg. 

It is perhaps of little surprise that in the fitted distribution the marginal odds ratio 
between MemUnion and ConBus is less than one; it equals 0.83. Except for the odds 
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Table 2. Data from the U.S. General Social Survey relating to Trust. 
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Table 3. ML estimate of the joint distribution under the symmetry 
group 5t w i n ; empirical distribution in parenthesis. 

D x = Di = 1 

D 2 = D 2 = l ~d7=0 D 2 = l 

_ 0.4824 0.1441 0.1441 0.0854 

2 ~ (0.4824) (0.1340) (0.1541) (0.0854) 

1 ~ _ 0.0193 0.0142 0.0092 0.0159 

A 2 = 1 

(0.0251) (0.0151) (0.0117) (0.0168) 

0.0193 0.0092 0.0142 0.0159 

j 2 ~ (0.0134) (0.0067) (0.0134) (0.0151) 

1 ~ 0.0050 0.0050 0.0050 0.0117 
Ai = \ 

(0.0050) (0.0034) (0.0067) (0.0117) 



ratio between MemUnion and MemChurch, which is equal to 0.85, all other fitted 
pairwise odds ratios are greater than or equal to 1. 

For purposes of comparison, in Figure [M b) we in c lude a classical graphical log-linear 
model obtained using the MIM program (jEdwardsl . 12000 ) by backward stepwise selec- 
tion, among all undirected models. This model has a deviance of 87.62 over 88 degrees 
of freedom. When comparing the undirected and bi-directed models it is quite striking 
that the undirected model contains one more edge, yet 62 fewer parameters. Observe 
that in the undirected graph union membership is also adjacent to the variables relating 
to confidence in clergy and whether or not people are helpful. We remark that latent 
variable models could be used for further analyses of these data. 



7. Independence and symmetry 

In this section we demonstrate how symmetry can be incorporated in the marginal 
independence models proposed earlier. The issue of symmetry naturally arises for the 
twin data shown in Table [1] in the introduction. Recall that we observe four binary 
indicators which inform us about each twins' alcohol dependence (Ai) and depression 
status (Di). When inspecting Tabled! one notices that counts related by exchanging 
the index labels 1 and 2 are often very similar. 

Let S be a group of permutations on the index set V. The group S acts on the set 
of elementary joint events X by permuting the components of i G X. In other words, for 
a G S and i = (i v \ v G V) G X, we define a(i) = (i a ( v ) \ v 6 V). This action induces 
the symmetry model 

B(S) ={ p eA\ Pi = Pa{i) W g S}. 
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Example 16. (Twin data). The symmetry group 



(24) 



Stwin = {(^l)(40(£>i)OD 2 ), (A 1 A 2 ){D l D 2 )} 



represents symmetry when exchanging vertex A\ with A 2 , and at the same time ex- 
changing D\ with D 2 . This symmetry corresponds to irrelevance of the labels given to 
the two twins. 

Since the symmetry model is a linear exponential family, the MLE may be computed 
by simply averaging the empirical cell counts over the orbit induced by the group action: 



where S(i) = {cr(i) \ a G S} is the orbit of cell i under the group S, n(j) is the empircal 
count for cell j, and n is the total sample size. For the twin data the ML estimate 
is shown in Table The deviance is 4.62 on 6 degrees of freedom, indicating a good 
fit. We now turn to testing the marginal independence hypothesis mentioned in the 
introduction, in conjunction with symmetry. 

A permutation a G S induces a new graph G a by renaming vertex v G V to a(v) G V. 
In other words, the graph G a has the same vertex set V as the original graph G = (V, E) 
but there is an edge v <-> w in G a if and only if there is an edge a _1 (v) <-> o~~ 1 (w) in the 
original graph G. We say that a group of permutations S leaves the graph G invariant 
if G a = G for all a E S, in other words, iS is a subgroup of the automorphism group of 
G. It follows that no new independences are introduced when imposing symmetry on 
the distributions in B(G). We will restrict attention to this case in what follows. 

Example 16. (continued). Let G be the graph displayed in Figure EJa), under the 
variable-vertex correspondence (1,2,3,4) = (A±, A 2 , D%, D 2 ) the independence pattern 
is A\ALD 2 and A 2 ALD\. The group <5>t w i n given in (j24"j) leaves G invariant. 

Theorem 21. If the symmetry group S leaves the graph G invariant, then a distribution 
p € B(Gr) is in the symmetry model B(«S) if and only if the Mobius parameters q G Qg 
forp satisfy that qc = Qa{c) f or a ^ connected sets C in G. 

Proof. First, note that under the assumed invariance of the graph, a set C C V is 
connected in G if and only if cr(C) is connected for all a G S. 

Consider p G B(G) n B(<S), and let C G C(G) and <r G S. Since ic* = cr _1 (i) (T ((7) we 
obtain that 




qc= Yl Pi= Yl P*- 1 (i)= Yl Pj = <la(C)- 

iel:i c =0 ieX:ic=0 ieX:i CT ( C )=0 
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Table 4. ML estimate of the joint distribution under A\ALD 2 and 
A 2 ALD\ together with the symmetry group 5 tw i n ; empirical distribution 
in parenthesis. 

Di=0 D-i = 1 



At = 



A-l 



A 2 = 



A, 



A 2 = 



A-> 



D 2 = D 2 = l D 2 = D 2 = l 

0.4612 0.1486 0.1486 0.0957 

(0.4824) (0.1340) (0.1541) (0.0854) 

0.0249 0.0204 0.0057 0.0104 

(0.0251) (0.0151) (0.0117) (0.0168) 

0.0249 0.0057 0.0204 0.0104 

(0.0134) (0.0067) (0.0134) (0.0151) 

0.0100 0.0038 0.0038 0.0054 

(0.0050) (0.0034) (0.0067) (0.0117) 



Conversely, assume that the Mobius q G Qg satisfy that qc = Qa(C) f° r au C £ C(G). 
Let DC Vbe disconnected and uniquely partitioned into inclusion-maximal connected 
sets as D = C1UC2U . . . UC r . Then the unique decomposition of cr(D) into inclusion- 
maximal connected sets is given by 

a(D) = CT(C7 1 )Ufj(C 2 )LJ • • • Ua(C r ), 

which implies that 

Qa(D) = Qa(d)Qa(C2) ' ' ' Qa(C r ) = QCi qC 2 ■ ■ ■ QC r = QD- 

Now consider i = (0a,Iv\a) G Then a(i) = (Q a (A)i IvWA))- Using Proposition [61 
we obtain that 

r*®= E (-i) |BV(A)l ^= E (-^ HB)XA ^- HB) =Pi. 

B:a(A)CB B:ACcr- 1 (B) 

□ 

For a subset C C V, let 5(C) = {o-(C) | cr G 5} be the orbit of C. 

Corollary 22. If the symmetry group S leaves the bi-directed graph G invariant, then 
the dimension of the marginal independence model with symmetry is 

dim(B(G)nB(5))= Yl V|S(C7)|. 

0^c*eC(G) 

Proof By dividing through |i?(C)|, every orbit of connected sets is counted once. □ 
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Corollary 23. If the symmetry group S leaves the bi-directed graph G invariant and 
the marginal independence model with symmetry is restricted to the interior of the prob- 
ability simplex, then one obtains the curved exponential family B+(G) nB(«S). 

Proof. The proof is analogous to the proof of Theorem 5.8. □ 

We define fig = n -pg to be the fitted cell counts under the symmetry model S, which 
are simply the group averaged cell counts. ML fitting of the model B(G) H B(<S) may 
be performed by simply applying ICF for fitting B(G) to hs rather than the observed 
cell counts. The rationale for this is as follows: let £g(p', { n W}) indicate the likelihood 
for B(G), evaluated with counts {n(i)}. Up £ B(G) D B(«S), then C G (p; {n(i)}) = 
£G(j>;{ns(i)})- Further, £ G (p; {hs(i)}) = Cg(ct(p); {ns (*)}), for any a G S, where we 
define a{p(i)) = p(a(i)). Thus the likelihood surface of the independence model B(G) 
given the group-averaged counts hs is invariant under permutations a 6 S applied to 
probability vectors p. It then follows that if p* is a local maximum of the likelihood 
function Caip; {hs(i)}), then so is cr(p*), for any a € S. Further p* and a{p*) are 
in the same contour of the likelihood function. Consequently if there is at most one 
local maximum of the likelihood function Cc{p\ { n s(i)}) m any given contour, then 
p* = o~(p*) for all a € S. Thus a maximum found by ICF when applied to hg, is 
in B(«S), and is thus a maximum of the likelihood for the model of symmetry and 
independence. 

Example 16. {continued). Applying ICF to fit the model A1ALD2 and A2-LL-D1 for 
the twin data, using the fitted counts from the symmetry model <Stwin resulted in the 
fitted distribution shown in Table 01 The combined model has a deviance of 16.156 
on 2 degrees of freedom, taking the symmetry model given by iStwm as the alternative. 
The corresponding p-value of 0.0003 indicates a poor fit and we may safely reject the 
generating hypothesis represented by the graph in Figure [T^a). 

The approach taken here to c ombin ing symmetry and independence is analogous to 



that of lAndersson and Madsenl (119981 ) in the Gaussian case. A more general approach 
would be to apply a symmetry group directly to the Mobius parameters, possibly with 
the restriction that orbits should only contain parameters c orresponding to sets of a 
given cardinality; this would be more analogous to the work of lHoisgaard and Lauritzen 



8. Related Work and Discussion 
Se veral other authors have made use of the Mobius decomposition or similar schemes. 



Led ()1993l ) used this decomposition to generate rando m binary vectors with fixed mar 



ginal distributions and specified degrees of association. lEkholm et al.l (|1995l . l2000i . 120031 ) 
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used dependence ratios (see Remark [T3l) to build association and regression models for 
multivariate discrete responses. Though Ekholm et al. did not study marginal indepen- 
dence models per se, their work on regression models offers one approach to building 
marginal independence models for mixed continuous and discrete variables, which is an 
open problem f or fut ure work. 

Kauermann (|1997l ) developed a parametrization for marginal independence models 
using the multivariate logistic (m-logit) transformation, which selects the highest order 
interaction term from every margin. However, the transformation from m-logit parame- 
ters to cell probabilities cannot, in general, be computed in closed form. Further, unlike 
classical log-linear parameters, the valid m-logit parameters may form a complicated 
subset of IR 2 ^ -1 and are not in general variation independent. The m-logit parame- 
terizati on is a special case of the marginal log- linear framework of iBergsma and Rudas 
(l2002bl ). In certain cases, such as for Figure [2](a) , there may exist a marginal log- linear 
parameterization for a marginal independen c e mode l in which the parameters are var i- 
ation independent; see IBergsma and Rudaa (|2002bl ). iLupparelli and Marchettil (|2005l ). 
However, there are models for which this approach does not appear to lead to vari- 
ation independent parametrizations. Specifically, there d oes not appear to be such a 
parametrization for the bi-directed chordless five cycle; see IBergsma and Rudasl (j2002al ) 
for related discussion. 

As stated earlier, the problems inherent in expressing marginal independence con- 
straints in terms of a log-linear parametrization over a larger set of variables are part of 
the general problem of 'lack of upward compatibility': specifically, a log-linear two-way 
interaction expresses a property of the full joint distribution, and not of the relevant 
two-way margin. A number of schemes have been pr oposed for deal i ng with this prob- 
lem, in addition to the m-logits mentioned above: see lip et al.1 (|2003l ): IStreitbergl (|1999l . 



1990). These provide alternative parametrizations for the binary bi-directed models 



in troduced h ere, which may be computed from th e fitted distribution, if desired. 



Cox 



(|1993l ) and lCox and Wermuthl (|1994l . ll996l ) take a different approach to the prob- 
lem of modelling independence structures similar to Gaussian covariance models. They 
focus on the quadratic binary exponen tial distribution, also known as the Boltzm ann 
machine (jHinton and Sejnowskil . Il983l ) or the auto-logistic scheme (|Besael . 1 19741 ). In 
this distribution, the absence of a given interaction term does not imply exact marginal 
independence, but by approximating the marginal distributions via series expansions, it 
is possible to gauge the size of any such dependence. As Cox notes, the extent to which 
such marginal approximations are reasonable will depend on the size of the relevant 
interaction terms. 
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Appendix: Likelihood Equations and Hessian calculations 

If G is a bi-directed graph with vertex set V, then for an arbitrary subset AC. V, let 

[A]g = {C | C is a maximal connected component of Ga}- 

Note that [A]g forms a partition A = Uce[A] G C- For disconnected sets D C V this 
partition is the one used in Theorem [HJ Since for a connected set C C V the family 
[C]g only comprises one set, namely C itself, we have that under a joint distribution in 
the model B(G), 

QA = IJ ?o, AC V. 

ce[A] G 

Hence for any set A, there is a unique expansion of the joint cell probability p\ in terms 
of the parameters qc for connected sets C in G, 

p v a= e (-!) |iAA| n 

B.ACB C:C€[B] G 

recall that p\ = P(Xa = 0,Xy\A = !)• We call this last expression the expansion for 
p\ (under graph G). 

Lemma 10. If C is a connected set in the graph G, then the parameter qc appears in 
the expansion for p\ if and only if Sp(C) H (A \ C) = 0. 

Proof. If Sp(C)n(A\C) = then CU(^4\C) forms a disconnected superset of A in which 
C is a maximal connected component. Hence C G [CU(A\C)] G - If Sp(C) n (A\C) / 
then there is a vertex a € A \ C such that a E Sp(C). Hence, in any set B containing 
A and C, there is a maximal connected set C 5 C U {a}. Hence C ^ [-B]g for any 
B 72 A. □ 

In words, Lemma [10] states that parameter qc appears in the expansion for p\ if 
and only if every vertex in A that is adjacent to C is already in C. Consequently, 
{d/dq c )p\ = for any connected set C in G that satisfies Sp(C) n (A \ C) ^ 0. 
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Lemma 11. // Sp(C) D (A \ C) = tfien 

dq c ~ { ' PA \° ■ 
Proof. The claim holds since Sp(C) D (A \ C) = iff (A \ C) C \ Sp(C)), and 

m = y. (-d |(buc,ui n «• 

y ° B:(A\C)CB C*:C*e{B] G 

and_BCy\Sp(C) 

= (-i) |c \ ai e (- i ) |bxa| n ^ 

£:(A\C)C.B C*:C*e[B] G 
andBCV\Sp(C) 

= (-l) |CVA| ^c P(C) - □ 

Corollary 12. XTie system of likelihood equations associated with the model P(G) con- 
tains an equation 



£ (-i) |CV4| >mc p<c, = » 



^ C A:Sp(C)n(A\C)=0 Pa 

/or every (non-empty) connected set C in G. 

The likelihood equations can also be expressed in terms of expectations with respect 
to conditional empirical measures (provided these exist): 



IF 

-XV\(Sp(C)\C) \ X Sp(C)\C=l 



(-l)^c^p(j Sp(c) |l nSp(c) ) 1 







where JEx n(Sp(c)NC) |x Sp(c)NO =i is expectation w.r.t. the measure on lv\(S P (C)\C) given 
by (normalizing) the empirical frequencies in the sub-table in which -X"sp(C)\C = 1- 

Lemma 13. Let C and C be connected sets in G. 

(i) // (Sp(C) n(A\ C)) U (Sp(C) D(A\ C)) ^ 0, then the second derivative 

9 9 i V n 
log Pa = 0. 



<9<?c <9gc 

(ii) // (Sp(C) n (A \ C)) U (Sp(C) n (A \ C)) = and C n Sp(C) + 0, iften 

9 9 ln __v _ r n |c\A| r n ic\Ai nsp(c) nsp(c) 1 

(iii) // (Sp(C) n (A \ C)) U (Sp(C) n (A \ C)) = and C n Sp(C) = 0, iften 



logpY = (-i)\(cuc)\A)\ nsp(cuc) _ J_ 



f_n|C\A|/_ n |C\A| V\Sp(C) V\Sp(C) . _J_ 
y L > y L > P A\C Pa\C (p^) 2 ' 



BINARY MODELS FOR MARGINAL INDEPENDENCE 



27 



Proof. This follows from Lemma fTTl The first term on the RHS of the equation in (iii) 
occurs if the derivative of (d / dC)p / ^^'^ is non-zero, which requires C C V \ Sp(C) 
and Sp(C") n(i\(CU C)) = 0. The second condition is implied by Sp(C) n (A\C) = 0. 
The first is equivalent toCfl Sp(C) =0. □ 

In words, the condition that C (1 Sp(C) = requires that C and C are disjoint and 
there is no vertex in C adjacent to a vertex in C . Note that Sp(C) fl C = if and only 
if Sp((7) fl C = 0, hence the conditions in (ii) and (iii) are symmetric in C and C as 
required. 

The full Hessian may be obtained by summing the expression given in the last Lemma 
over all sets A C V. 
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